Reject unresolvable unit symbols instead of dropping them by nertzy · Pull Request #30 · minad/unit

nertzy · 2026-06-12T19:12:55Z

Summary

The lexer built its symbol character class only from the glyphs of registered symbols, so any other non-ASCII glyph matched no token and String#scan silently skipped it. Unknown ASCII words still matched the baseline [A-Za-z0-9_]+ class and reached validation, but unknown glyphs vanished:
- Unit(1, 'λ') returned a unitless 1
- Unit(1, 'm λ') silently became 1 m
- Unit(1, 'm/λ') raised a misleading SyntaxError: Unexpected token /
Now the lexer tokenizes any run of non-operator, non-whitespace characters as a symbol, so every unrecognized glyph survives lexing and fails loudly as Undefined unit <symbol> through the same validation path unknown ASCII symbols already used — and the error names the real culprit (Undefined unit λ) rather than a neighboring operator.
As a side effect the tokenizer no longer depends on what is registered, so it is a plain constant again instead of a per-load rebuild. Registered glyphs (Ω, °C, µ, %, units loaded at runtime) still parse, and to_s round-trips are preserved.

Breaking: parsing an unrecognized unit symbol now always raises TypeError: Undefined unit … instead of sometimes silently yielding a dimensionless value. This extends the contract that unknown ASCII symbols already followed to unknown glyphs of any alphabet.

Test plan

bundle exec rake — both phases green (140 no-DSL + 142 DSL examples, 0 failures)
New specs cover: unknown ASCII symbol; unregistered glyph alone; glyph implicitly multiplied with a known unit; glyph adjacent to an operator; and a unit whose defining system has not been loaded yet (raises until loaded, then resolves)

The lexer built its symbol character-class only from registered glyphs, so any other non-ASCII glyph matched no token and String#scan silently skipped it. Unknown ASCII words still matched the baseline class and reached validation, but unknown glyphs vanished: Unit(1, 'λ') returned a unitless 1, Unit(1, 'm λ') silently became 1 m, and Unit(1, 'm/λ') raised a misleading 'Unexpected token /'. Tokenize any run of non-operator, non-whitespace characters as a symbol so every unrecognized glyph survives lexing and fails loudly as 'Undefined unit <symbol>' via the same validation path unknown ASCII symbols already used, naming the real culprit. This also makes the tokenizer independent of what is registered, so it is a plain constant again rather than a per-load rebuild. Add specs covering unknown ASCII symbols, unregistered glyphs (alone, implicitly multiplied, and operator-adjacent), and a unit whose defining system has not been loaded yet.

nertzy merged commit 63e438e into minad:master Jun 12, 2026
4 checks passed

nertzy deleted the strict-lexer branch June 12, 2026 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject unresolvable unit symbols instead of dropping them#30

Reject unresolvable unit symbols instead of dropping them#30
nertzy merged 1 commit into
minad:masterfrom
nertzy:strict-lexer

nertzy commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nertzy commented Jun 12, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant